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Abstract 

Background: In population association studies, standard methods of statistical inference assume that study subjects 
are independent samples. In genetic association studies, it is therefore of interest to diagnose undocumented close 
relationships in nominally unrelated study samples. 

Results: We describe the R package CrypticlBDcheck to identify pairs of closely-related subjects based on genetic 
marker data from single-nucleotide polymorphisms (SNPs). The package is able to accommodate SNPs in linkage 
disequibrium (LD), without the need to thin the markers so that they are approximately independent in the 
population. Sample pairs are identified by superposing their estimated identity-by-descent (IBD) coefficients on plots 
of IBD coefficients for pairs of simulated subjects from one of several common close relationships. 

Conclusions: The methods implemented in CrypticlBDcheck are particularly relevant to candidate-gene 
association studies, in which dependent SNPs cluster in a relatively small number of genes spread throughout the 
genome. The accommodation of LD allows the use of all available genetic data, a desirable property when working 
with a modest number of dependent SNPs within candidate genes. CrypticlBDcheck is available from the 
Comprehensive R Archive Network (CRAN). 
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Background 

It is well known that the results of genetic association 
studies may be confounded by the presence of undoc- 
umented relationships - a phenomenon referred to as 
cryptic relatedness (e.g., [1,2]). For example, [3] found that 
tests of association between genetic markers and quan- 
titative phenotypes such as serum LDL tended to have 
inflated significance when relationships between individ- 
uals from a large Hutterite kindred were not accounted 
for. Before making any inference with the data, it is there- 
fore important to understand cryptic relatedness in the 
study sample. To facilitate this understanding, we intro- 
duce CrypticlBDcheck, an R [4] package for exploring 
the presence of close relationships in a homogeneous 
sample of nominally unrelated individuals. Although sev- 
eral methods for exploring cryptic relatedness have been 
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implemented (reviewed below), none are geared for data 
from candidate-gene association studies. CrypticlBD- 
check fills this need. For ease of interpretation, the pack- 
age implements exploratory displays based on popular 
measures of gene-identity by descent. However, a unique 
feature of these displays is that they accommodate popu- 
lation linkage disequilibrium (LD) amongst genetic mark- 
ers. The accommodation of LD allows the use of data on 
all available markers rather than on a subset whose alle- 
les are approximately independent in the population. This 
feature is attractive in candidate-gene association studies, 
where markers within genes are in LD but the number 
of genes is too small to select an independent subset of 
markers that is informative for relationship. Using the 
simulated data set analyzed in the Examples section we 
have found it possible to distinguish parent-offspring or 
full sibling pairs from unrelated individuals using as few 
as 60 candidate genes (average of five SNPs per gene). We 
return to the issue of how many SNPs are appropriate 
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for analysis with CrypticIBDcheck in the Conclusions 

section. 

The relatedness between two individuals may be defined 
in terms of the proportion of loci at which they share zero, 
one or two alleles that are identical-by-descent (IBD). 
We refer to these proportions as the actual IBD-sharing 
coefficients, or IBD coefficients. The alleles from two indi- 
viduals are IBD if they are descended from a common 
ancestor in a given reference population (e.g., [5]). Though 
alleles from each of two individuals may match or be 
identical-by-state (IBS), they are not necessarly IBD. 

CrypticIBDcheck uses estimated IBD coefficients to 
summarize possible relationships among pairs of study 
subjects. The approach is exploratory and graphically- 
based, similar to the GRR approach of [6] and the 
approach of [7] implemented in the ibdPlot ( ) function 
of the GWASTools Bioconductor package. 

GRR calculates and displays the mean and variance of 
IBS allele sharing over polymorphic loci for each pair 
of individuals. Pairs of known relationships form refer- 
ence clusters on the plot, allowing the user to identify 
errors in reported relationships. In association studies of 
nominally unrelated individuals, however, there are no 
reference clusters available. In principle, reference clusters 
could be obtained theoretically from the joint distribu- 
tion of the IBS mean and variance estimators, but it is 
unclear how to derive this distribution in the presence 
of LD. 

The ibdPlot ( ) function in GWASTools may be 
applied to view estimated IBD coefficients along with ref- 
erence clusters for the unobserved, true IBD coefficients 
based on theoretical moments of their distribution [8]. 
However, in candidate-gene studies with a modest num- 
ber of single-nucleotide polymorphisms (SNPs), errors 
introduced by estimation of IBD coefficients cannot be 
ignored. Hence, reference distributions for the true IBD 
coefficients do not adequately represent those for esti- 
mated IBD coefficents. 

The idea behind CrypticIBDcheck is to identify 
closely-related study pairs by displaying their estimated 
IBD coefficients together with those from simulated pairs 
of known relationships. The simulated reference pairs 
provide an empirical joint distribution of the IBD estima- 
tors under selected relationships which, in turn, suggest 
possible relationships amongst study pairs. Working with 
simulated pairs from known relationships avoids having 
to derive the joint distribution of the IBD estimators 
when the genetic markers are in LD. Simulated pairs are 
obtained by gene drop on a relationship-specific pedi- 
gree, with pedigree-founder haplotypes drawn from a 
fitted haplotype model that accounts for LD [9]. We have 
implemented simulation of the following common rela- 
tionships: monozygotic twins/sample duplicates, parent- 
offspring, full siblings, second degree (i.e., half siblings, 



avuncular or grandparent-grandchild) and first cousins. 
However, users may also specify their own custom rela- 
tionships (see the Examples section). 

The paper is structured as follows. In the Methods 
section we describe the IBD estimators and methods for 
gene drop simulation in the presence of LD to obtain 
reference clusters. The Results and Discussion section 
describes implementation details provides two examples 
of how to use the package. The paper ends with a 
Conclusions section that includes ideas for future work. 

Methods 

IBD estimation 

There are two common approaches to estimating IBD 
coefficients: maximum likelihood [10-12] and the method 
of moments [13-15]. Typically, maximum-likelihood esti- 
mators (MLEs) are more biased than method-of-moments 
estimators (MMEs), especially when the number of loci 
is small; they are also more computationally expensive 
[14]. However, MMEs are less precise than MLEs and 
can fall outside the biologically meaningful parameter 
space [11]. 

In this section, we review a popular method of moments 
approach to estimating IBD coefficients introduced by 
[15] and implemented in PLINK. This approach assumes 
that the individuals are from the same homogeneous, 
random-mating and non-inbred population. Alleles from 
two individuals are considered to be IBD if they are 
descended from a common ancestor in some base popu- 
lation that we take to be relatively recent. All alleles in this 
base population are defined to be non-IBD. Given a SNP 
with alleles A and a, a pair of individuals that are, say, AA 
and aa, respectively, will be denoted (AA, ad). 

Identity-by-state (IBS) for a pair of subjects is denoted 
by the random variable / and identity-by-descent by the 
random variable Z, with possible states being 0, 1, and 2 
for both random variables. The IBD coefficients to be esti- 
mated are the proportions of genome shared IBD, denoted 
by P(Z = 0), P(Z = 1), and P(Z = 2). For a given SNP m, 
the procedure begins by expressing the prior probability 
of IBS sharing as 

i 

P m (I = i) = J2 p m(I = i\Z = z)P(Z = z). (1) 

z=0 

P(Z = z) and P m {I = i) are specific to the pair of sub- 
jects being considered, while the conditional SNP-specific 
IBS probabilities P m (I = i\Z = z) apply to all pairs. 
For a given pair of individuals at a given SNP, the above 
equation specifies three identities for the IBS states 0, 1, 
and 2. These three identities are summed over SNPs and 
then rearranged to express P(Z = 0), P(Z = 1), and 
P(Z = 2) for the pair in terms of marginal and conditional 
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IBS probabilities. For example, in the case of i = z = 0, we 
obtain 

P(Z = 0) = = 0)/J2 p m(I = o\z = 0). 



The method-of-moments estimators of IBD coefficients 
for a given pair are obtained by substituting estimators 
of the conditional SNP-specific IBS probabilities, P m (I = 
i\Z = z), pertaining to any pair and the pair's marginal 
SNP-specific IBS probabilities, P m (I = i), into the identi- 
ties and then solving for P(Z = i). 

The marginal SNP-specific IBS probabilities, P m (I = i) 
for a pair of subjects may be estimated by the indica- 
tor function for whether the pair has / = i at the SNP 
An unbiased estimator of ^l m P m (I = 0 is therefore the 
count of SNPs at which the pair shares i alleles IBS. Esti- 
mates of the SNP-specific conditional IBS probabilities, 
P m (I = i\Z = z), are based on data from all subjects in 
the sample. Derivation of unbiased estimators of P m (I = 
i\Z = z) is more involved. To simplify notation, we tem- 
porarily drop the SNP subscript m. If p and q = 1 — p 
denote the frequencies of A and a in the base population, 
then P(I = i\Z = z) is a function of p and q. For example, 
two individuals share 0 alleles IBS if they are either (AA, 
ad) or {AA, ad). Given that Z = 0, the probabilities of 
these genotypes are p 2 q 2 and q 2 p 2 , respectively, leading to 
P(I = 0|Z = 0) = 2p 2 q 2 . The plug-in estimators of condi- 
tional IBS probabilities, such as P{I = 0|Z = 0), obtained 
by inserting estimators p and q are biased [Additional file 
1]. Unbiased estimators, expressed as the plug-in estima- 
tor multiplied by a correction factor, may be derived as 
described next. 

Let X and Y be the counts of the alleles A and a, respec- 
tively, so that the allele frequency estimators are p = X/T 
and q = Y/T, where T is twice the number of observed 
genotypes in the population random sample. The estima- 
tors of the conditional IBS probabilities P(I = i\Z = z) 
may be motivated by the following model. The genotype of 
each individual in the present population is obtained from 
two independent draws from an infinite base population 
of alleles. Consequently, the T alleles of a population ran- 
dom sample of study subjects can be viewed as a random 
sample from the base population. Moreover, conditional 
on IBD status, any pair of individuals in the present popu- 
lation can be viewed as independent allelic draws from the 
base population, with the number of draws determined by 
their IBD status. 

For example, in the case of Z = 0, a random pair of 
individuals results from randomly drawing two pairs of 
alleles from the base population. An indicator variable of 
whether this sampling process results in / = 0 is an unbi- 
ased estimator of P(I = 0|Z = 0). An unbiased estimator 
is therefore the average of these indicator variables over all 
possible draws from the T alleles on which we have data; 



i.e., the proportion of pairs of allelic pairs with 1 = 0. The 
proportion can be computed as follows. The number of 
ways of selecting four distinct alleles from a total of T is 
T(T - 1)(T - 2)(T - 3). Without loss of generality, sup- 
pose the first two alleles are assigned to the first individual 
in a pair and the last two alleles to the second individual. 
Then the number of pairs that are (AA, ad) and {AA, ad) 
areX(X- \)Y(Y -\) and Y(Y- 1)X(X- 1), respectively. 
Hence, 

2X(X- l)Y(Y - 1) 

P(I = 0|Z = 0) = — — , 2) 

T(T - 1)(T - 2)(T - 3) 

is an unbiased estimator oiP(I = 0|Z = 0) (see Additional 
file 1 for verification by direct computation). After algebra, 
the unbiased estimator may be expressed in terms of the 
allele frequency estimators and a correction factor as: 



>(/ = 0|Z = 0) =2p 2 q 2 



(X-\ Y-l T 

( x x 

V X Y T-l 



T T 

x x 



T-2 T-3, 



For Z = 1, we consider a pair of individuals to be the 
result of drawing three alleles from the base population, 
one of which is shared by the pair of individuals. The pro- 
portion of such pairs of individuals with IBS state / = 1 in 
our data is an unbiased estimator of P(I = 1|Z = 1). The 
number of ways to select three distinct alleles from a total 
of Tis T(T — 1)(T — 2). Among these, the genotype pairs 
that are / = 1 are the X(X - 1)7, YX(X - 1), Y(Y - l)X, 
and XY(Y — 1) that are (AA, ad), (AA, ad), (AA, ad), and 
(AA, ad), respectively. Thus, 



P(I= 1|Z= 1) 



2X(X- l)Y + 2XY(Y - 1) 
T(T-l)(T-2) 

2XYX fX -I T T 

x x 



TT 2 \ X T-l T-2 

2XY 2 fX -I T T 

x x 



TT 2 \ X T-l T-2 

~/X-l T T \ 

= 2^^— x — x— j 

Y^x^x^V 
V X T-l T-2) 



+ 2pq z 



The other conditional IBS probabilities are estimated in 
an analogous manner and their expressions are provided 
in Table one of [15]. 
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With estimates P m (I = i) and P m (I = i\Z = z) for each 
SNP, we sum over SNPs to obtain estimates of the IBD 
coefficients for a given pair in the sample. Let 

L 

N(I = i\Z = z) = Pmil = i\Z = z) and 

m=l 
L 

N(I=i) = J2 P m(I = 0, (3) 

m=l 

where L is the total number of SNPs with genotype data 
on both individuals. For any pair of subjects, summing 
equation (1) over all the SNPs and using equation (3) gives 
the following method-of-moment estimators of the IBD 
coefficients: 

MI = o) 

i>(Z = 0) = ^ — v — 

M(I = 0|Z = 0) 

p = W = D-HZ = 0) xW= 1|Z = 0) 

AT(/=1|Z = 1) 

hl r_^_W = 2)-P(Z = 0)xN'I = 2\Z=0)-P(Z = l)xN(I=2\Z = l) 
1 \A — 2) — ~ • 

Af(/=2|Z=2) 

Adjustments to bound these estimators to values con- 
sistent with their interpretation as IBD proportions were 
proposed in [15]. We have not made these adjustments in 
our graphical displays. 

Gene drop simulation with LD 

The package provides a graphical display that can be used 
to identify related sample pairs by plotting the estimated 
IBD coefficients P(Z = 1) versus P(Z = 0). To assess 
the variability of these estimators the points of the IBD 
plot are superposed on reference clusters obtained from 
one of the following relationships: unrelated, monozy- 
gotic twins/duplicates, parent-offspring, full siblings, half 
siblings and first cousins. These reference clusters are 
obtained by gene drop simulation that accounts for LD 
[16]. A strength of this approach is that we do not need 
to assume independence of marker loci. In candidate-gene 
association studies, this feature is important because of 
the dependence among a relatively small number of SNPs. 
Ignoring the dependence among SNPs within genes pro- 
duces reference clusters that are too tight relative to the 
true variability, and can lead to false-positive results. We 
return to this point in the examples. 

A graphical model is an approach to modeling the joint 
distribution of a set of dependent random variables when 
many independences or conditional independences exist 
between subsets of the variables. In the case of LD, it is 
expected that the joint distribution of alleles allong hap- 
lotypes shows such a structure. A flexible graphical model 
of haplotype frequencies that captures LD between loci 



is described in [17]. The model is fit to data from sub- 
jects that can be regarded as a population random sample; 
e.g., the controls in a case-control study of a rare disease. 
Model parameters are estimated by use of a stochastic 
optimization algorithm [16]. 

Once the LD model is fit, it is used to sample haplotypes 
for the founders of a pedigree. Data on the remaining 
members of the pedigree are simulated by gene drop. 
Gene drop is a method for randomly generating the 
genotypes of related individuals in a pedigree. Alleles 
are "dropped" from the founders through the pedigree 
according to Mendel's laws. Multi-locus gene drop incor- 
porates the process of recombination. To illustrate the 
simulation procedure, consider a parent-offspring rela- 
tionship. A pedigree that encompasses this relationship 
is one comprised of two parents and the offspring. The 
founders are the parents. Parental haplotypes are simu- 
lated from the fitted LD model and are then dropped to 
the offspring. To mimic real data with missing genotypes, 
selected genotypes for a simulated individual are set to 
missing according to the missing genotype pattern of a 
randomly-sampled study subject. 

Programs for fitting LD models and performing gene 
drop simulations are available in the Java Programs 
for Statistical Genetics and Computational Statistics 
(JPSGSC) library developed by Alun Thomas (http:// 
balance.med.utah.edu/wiki/index.php/JPSGCS). We use 
the R package rJPSGCS [18] to access these programs 
from R. 

Results and Discussion 

Implementation 

The main function in CrypticIBDcheck is IBDcheck ( ) , 
which estimates IBD coefficients for pairs of study sub- 
jects and optionally for simulated pairs of subjects and 
returns an object of class IBD. The plot method for 
the IBD class displays the IBD coefficients for pairs of 
study subjects, along with prediction ellipses for known 
relationship pairs. 

The arguments of IBDcheck () are constructed by 
the functions new.IBDO, filter . control ( ) and 
sim. control ( ) . The function new. IBD produces an 
object of class IBD suitable for input to IBDcheck (). 
At a minimum, such an object includes the genetic data 
as a snp. matrix object from the chopsticks package 
[19], a data frame of SNP information that includes chro- 
mosome and physical map positions of each SNP, and 
a data frame of subject information that includes a log- 
ical vector indicating whether (TRUE) or not (FALSE) 
each subject is to be used to estimate the conditional IBS 
probabilities and fit the LD model. Optionally, the SNP 
information may include genetic map positions, in centi- 
Morgans. If genetic map positions are missing, they are 
inferred assuming the physical positions are from build 
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36 of the human genome. Users with SNP data on diploid 
non-human organisms, such as mouse or drosophila, 
must supply their own genetic map positions. The docu- 
mentation for new . IBD ( ) and the examples below pro- 
vide further details. The function filter . control ( ) 
sets options for quality control filtering of data by SNPs 
and by subjects, while the function sim . control ( ) 
sets options that control simulation of subjects by gene 
drop. The respective help files and the examples below 
provide further details. As the fitting of LD models 
in IBDcheck ( ) can be computationally demanding, 
users have the option of splitting computations across 
a snow cluster [20], as described in Additional file 2. 
The output of IBDcheck () is an object of class IBD, 
which includes the estimated IBD coefficients for pairs 
of study subjects and for simulated pairs with known 
relationship. 

IBD objects are graphically displayed by the plot 
method of the class; the documentation for this method 
is available through help ( ' 'plot . IBD' ' ) . Plots are 
of P(Z = 1) versus P(Z = 0) for pairs of study sub- 
jects, with prediction ellipses for known relationships 
superposed, if requested by the user. The prediction 
ellipses are produced from estimated IBD coefficients for 
a user-specified number (default 200) of simulated pairs 
of known relationships, assuming the distribution of esti- 
mated IBD coefficients is approximately bivariate Normal. 
The default setting for IBDcheck () is to omit simu- 
lated pairs from the object. When simulated pairs are 
omitted, plotting produces a single interactive display of 
estimated IBD coefficients for pairs of study subjects, on 
which points may be identified by clicking with the mouse. 
On the other hand, when the IBD object includes simu- 
lated pairs, the function returns a series of plots, which the 
user is prompted to view and interact with successively. 
The first plot to appear is non-clickable and shows the 
estimated IBD coefficients for all pairs of study subjects, 
along with the prediction ellipse for unrelated, simulated 
pairs. Subsequent plots are clickable and correspond to 
each relationship requested in the call to IBDcheck ( ) . 
These relationship-specific plots are for identifying pairs 
of study subjects which could have the relationship. The 
plotting regions are restricted to the neighborhood of 
the prediction ellipse for the simulated pairs of that rela- 
tionship, which is also drawn. If, however, the plotting 
region overlaps with the prediction ellipse for simulated 
unrelated pairs, the ellipse for simulated unrelated pairs 
is drawn as well. Points falling within the prediction 
ellipse for the relationship and outside the prediction 
ellipse for unrelated pairs are automatically flagged. In 
addition, users may click on points of study pairs that 
appear to be related but are not automatically flagged. 
The plot method produces a data frame of information on 
pairs that have been flagged on the different plots, either 



automatically or interactively by the user through clicking 
the mouse. 

Examples 

We illustrate the features of the CrypticIBDcheck 
package using the genetic data Nhlsim that comes with 
the package. These data were simulated to mimic the 
characteristics of SNP genotypes in subjects of European 
ancestry from a candidate-gene, case-control study of 
non-Hodgkin Lymphoma [21]. The data set is a list com- 
prised of (i) a snp . matrix object called snp . data with 
genotypes for 108 controls and 100 cases; (ii) a vector 
chromosome of chromosome numbers for each SNP; (iii) 
a vector physmap of physical map positions of each SNP, 
from build 36 of the human genome; and (iv) a binary vec- 
tor csct with value one for cases and zero for population 
controls. The binary vector csct is used to select controls 
for fitting LD models and estimating conditional IBS prob- 
abilities. All of the information in Nhlsim is required to 
run IBDcheck ( ) . 

We present two examples. In the first (Default anal- 
ysis with LD model fitting and gene drops), we illus- 
trate basic use of IBDcheck ( ) to fit LD models and 
do gene drop simulations. Once the user requests sim- 
ulations, there are a number of parameters, such as the 
types of relationships to simulate, that control the sim- 
ulations. Each simulation parameter has a default value, 
as described in the help file for sim. control ( ) . In 
the first example we use these default settings. In the 
second example (Additional gene drops using previous- 
ly-fit LD models), we illustrate re-use of fitted LD models 
to perform additional gene drop simulations, this time 
for a user-specified relationship. For examples of how to 
use IBDcheck ( ) to explore genome-wide data, we refer 
readers to the package vignette IBDcheck-hapmap that 
illustrates an analysis of genome-wide data from HapMap, 
using thinning of markers to reduce the computational 
burden. 

Default analysis with LD model fitting and gene drops 

We first load the package and the Nhlsim data set. 

R> library ( 1 1 CrypticIBDcheck' ' ) 

Loading required package: rJPSGCS 

Loading required package: rJava 

Loading required package: chopsticks 

Loading required package: survival 

Loading required package: splines 

Loading required package: car 

Loading required package: MASS 

Loading required package: nnet 

Loading required package: ellipse 
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Attaching package: 'ellipse' 

The following object (s) are masked from 
'package : car' : 

ellipse 

R> data ( "Nhlsim" ) 

Next we create an object of class IBD that can be used 
as input to the IBDcheck ( ) function. The Nhlsim data 
does not include genetic map positions for the SNPs, so 
these will be inferred from the physical positions, assum- 
ing the physical positions are from build 36 of the human 
genome. We use subjects with case-control status 0 (con- 
trols) for estimating conditional IBS probabilities and 
fitting LD models. 

R> popsam <- Nhlsimc$sct==0 

R> dat <- new . IBD (Nhlsim$snp . data, 

Nhlsim$ chromosome , Nhlsim$physmap, popsam) 

Note: Input does not include genetic map 

locations (Gen_loc) . 
Inferring genetic map from physical 

position (Position) , assuming build 36 of 

the human genome . 

Note: Using population sample subjects 
(popsam==TRUE) to fill in pvalues from 
tests of HWE. 

Note: Input does not include subject ids 
(subids) . Using rownames of snp.data. 

In this illustration, we leave all QC filtering options 
(set by f ilter . control () ) at their default values. 
We use sim . control ( ) to modify the default value of 
simulate = FALSE to simulate=TRUE, so that refer- 
ence clusters are simulated. 

R> ss <- sim. control (simulate=TRUE) 
R> cibd <- IBDcheck (dat , simparams=ss ) 

This call to IBDcheck ( ) will generate 22 plain-text 
files in the user's working directory that contain details of 
the fitted LD models for each of the 22 autosomal chromo- 
somes. The names of these files are stored in the output 
IBD object: 

R> cibd$simparams$LDf iles 

[1] "GDI. Id. par" "GD2.1d.par" "GD3.1d.par" 
"GD4.ld.par" "GD5.1d.par" 



[6] 
[11] 

[16] 

[21] 



GD6.ld.par" "GD7.1d.par" "GD8.1d.par" 
GD9.ld.par" "GD10 . Id . par " 
GD11 . Id. par" 



GD13 . Id. par" 
GD15 . Id. par" 
GD16 . Id. par" 
GD18 . Id. par" 
GD20 . Id. par" 
GD21 . Id. par" 



"GD12 . Id. par" 
"GD14 . Id. par" 

"GD17 . Id. par" 
"GD19 . Id. par" 

"GD22.ld.par"} 



The section Additional gene drops using previous- 
ly-fit LD models gives an example of how to re-use 
these fitted LD models for performing additional gene 
drops. The output includes estimated IBD coefficients 
for pairs of subjects in the input data and for simu- 
lated pairs of subjects from the following relationships: 
unrelated, duplicates/MZ twins, parent-offspring, full sib- 
lings and half-siblings. Simulation of first-cousin or user- 
specified relationship pairs is also possible, but is not 
done by default. First cousins are typically not distinguish- 
able from unrelated pairs with data from a candidate- 
gene association study. The estimated IBD coefficients 
can be plotted with the plot method for the IBD 
class. 



R> ibdpairs = plot (cibd) 
R> ibdpairs 

memberl member2 pzO pzl relationship 

1 subl sub201 0.0000000 1.0170209 parent-offspring 

2 sub2 sub202 0.0000000 1.0049532 parent-offspring 

3 sub203 sub206 0.1941393 0.5161277 full sibs 

4 sub204 sub207 0.2294063 0.5217202 full sibs 

5 sub205 sub208 0.2057583 0.4434503 full sibs 

6 sub31 sub44 0.5264858 0.4505440 half sibs 



In this example, the plotting function produces five 
plots, shown in Figures 1, 2, and 3, and an output data 
frame ibdpairs that contains information on study 
pairs flagged with the last four plots in Figures 2 and 3. 

Figure 1 shows the non-clickable plot of the estimated 
IBD coefficients, P(Z = 1) versus P(Z = 0), for all pairs 
of study subjects, with the prediction ellipse for unrelated 
pairs superposed. The level of the prediction ellipse is left 
at its default value of ellipse . coverage=0 . 95 and, 
for unrelated pairs, is adjusted to account for the majority 
of study pairs being unrelated. Specifically, a Bonferroni- 
type adjustment, 1 — (1 — ellipse . coverage)/^, 
is applied, where n p is the number of pairs of study 
subjects. One purpose of the prediction ellipse is to avoid 
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pzO 

Figure 1 IBD coefficients for all pairs. Estimated IBD coefficients for 
all pairs of study subjects, with the prediction ellipse for unrelated 
pairs superposed. 



confusing the display by adding points for simulated pairs. 
Another purpose is to avoid having to manually click 
points for study pairs that appear within a cloud of points 
from simulated pairs. We adopted a bivariate normal 
approximation to the prediction ellipse because it cor- 
rectly identified the majority of points in experiments with 
simulated data (e.g., Figure 1). However, in Figure 1, sev- 
eral unrelated pairs appear outside the prediction ellipse, 



indicating that the distribution of estimated IBD coeffi- 
cients is slightly heavier-tailed than the bivariate normal 
approximation. 

For the four other plots, shown in Figures 2 and 3, 
points that lie within a 95% prediction ellipse (the default 
level for ellipse . coverage) for the given relationship 
and outside the prediction ellipse for unrelated pairs are 
automatically flagged. In addition, these plots are click- 
able, and points flagged manually are added to the output 
dataframe. For example, on the plot for half-siblings, the 
point corresponding to the pair sub3 5 and sub95 has 
been manually flagged (Figure 3, right panel); this pair 
appears in both the prediction ellipse for unrelated pairs 
and the upper portion of the prediction ellipse for half sib- 
lings. Manually clicking on the point for this pair adds the 
following row to the output dataframe ibdpairs: 

7 sub35 sub95 0.4568146 0.6351182 half sibs 

In this data set, there are no duplicate/MZ twins pairs 
and no pairs flagged as such (Figure 2, left panel). The 
two parent-offspring pairs in the Nhlsim data fall in 
the prediction ellipse for parent-offspring pairs (Figure 2, 
right panel). Similarly, all three full-sibling pairs in the 
Nhlsim data fall in the prediction ellipse for full siblings 
(Figure 3, left panel). The substantial overlap of the predic- 
tion ellipses for half siblings and unrelated pairs (Figure 3, 
right panel) indicates insufficient data to distinguish these 
two relationships. Though there are no half- sibling pairs 
in Nhlsim, one pair of unrelated subjects, sub31 and 
sub44, has atypical estimated IBD coefficients that fall 



MZtwins/duplicates 



parent-offspring 




"T" 




0.00 0.05 
pzO 



0.00 0.05 
pzO 



Figure 2 Possible MZ twins/duplicates and parent-offspring pairs. Observed pairs with prediction ellipses for MZ twins/duplicates (left panel) 
and parent-offspring pairs (right panel) superposed. 
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within the prediction ellipse for half siblings but outside 
the prediction ellipse for unrelated pairs. 

The unrelated pair flagged as a potential half-sibling pair 
is a false-positive result. We observed (results not shown) 
that the number of false-positive related pairs is greatly 
increased if we fail to take the LD between SNPs into 
account. Specifically, if we repeat the simulation of unre- 
lated and half-sibling pairs of subjects assuming indepen- 
dent SNPs (f it LD= FALSE), we obtain 16 false-positive 
half sibling pairs. These observations highlight that 
naively ignoring the dependence among SNPs produces 
reference clusters that are too tight relative to the true 
variability. 

Additional gene drops using previously-fit LD models 

By far the most computationally-demanding step of 
IBDcheck ( ) is the fitting of LD models. The fitted LD 
models are stored in plain-text files in the working direc- 
tory and can be re-used for future gene drops using the 
argument LDfiles of sim . control () , as we now 
illustrate. We also demonstrate how users can create their 
own relationships to use as reference clusters on the IBD 
plot. 

Setting of simulation parameters, such as the names 
of fitted LD model files and specification of the rela- 
tionships to simulate, is done with the sim . control ( ) 
function. Recall that the names of the LD files are stored 
in the IBD object created by a call to IBDcheck ( ) ; 
for example: 



R> cibd$simparams$LDf iles 



[1] 



[6] 



[11] 



[16] 



[21] 



"GDI 
"GD3 
"GD5 
"GD6 
"GD8 
"GDI 
"GD11 
"GD13 
"GD15 
"GDI 6 
"GDI 8 
"GD2 0 
"GD21 



. Id . par" 
. Id . par" 
. Id . par" 
. Id . par" 
. Id . par" 
0 . Id. par" 
. Id . par" 
. Id . par" 
. Id . par" 
. Id . par" 
. Id . par" 
. Id . par" 
. Id . par" 



"GD2 
"GD4 

"GD7 
"GD9 

'GD12 
'GD14 

'GDI 7 
'GDI 9 



, Id . par" 
. Id . par' 

. Id . par' 
. Id . par' 

, Id . par" 
. Id . par" 

. Id . par' 
. Id . par' 



'GD22 . Id. par" 



These fitted models are re-used by specifying their 
names as the argument LDfiles to sim. contol: 

R> ss <- sim. control (simulate=TRUE, 
LDf iles =cibd$simparams $LDf iles ) 

The sim. control () function can also be used to 
specify the relationships to simulate; e.g., one can obtain 
simulated cousin pairs with 

R> ss <- sim. control (simulate=TRUE , rship= 
"cousin" , LDf iles=cibd$simparams$LDf iles ) 

It is also possible to obtain pairs simulated according 
to a user-specified relationship. In the following, the rela- 
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tionship of interest is parent-offspring with first cousins 
parents. The relationship is depicted in Figure 4, which 
was drawn using Pedfiddler [22]. To simulate according to 
this relationship, it is necessary to specify a minimal pedi- 
gree that captures the relationship between the mother 
and daughter and to have the mother and daughter be the 
first two members of the pedigree. The pedigree drawn in 
Figure 4 has parents (nodes 2 and 3) that are first cousins. 
Pedigree information is specified in a data frame whose 
rows describe subjects. The columns of the data frame are 
member IDs, the IDs of each member's father and mother, 
and gender, coded as 1 for male and 2 for female. For pedi- 
gree founders, the father and mother IDs are set to zero. 
Specification of the pedigree in Figure 4 is as follows: 

userdat <- data . frame ( ids=l : 9 , 

dadids=c (3,5,7,0,9,9,0,0,0) , 
momids=c (2,4,6,0,8,8,0,0,0) , 
gender=c (2,2,1,2,1,2,1,2,1) ) 

The call to IBDcheck ( ) would then be: 

ss <- sim . control ( simulate=TRUE , 

rships= "user " , userdat=userdat , 
LDf iles=cibd$simparams$LDf iles) 



8 


9 


4 5/ 


\e 7 


V2 


3' 


■ 




0 


Figure 4 Pedigree for an offspring of a first-cousing mariage. 


Circles represent females, squares represent males. Lines of descent 


are indicated by connections between nodes. The mother and 


daughter of interest are labelled as 2 and 1, respectively. 



ff <- filter. control (filter=FALSE) 
cibd.user <- IBDcheck (cibd, simparams=ss, 
f ilterparams=f f ) 

where the argument filter=FALSE to filter, 
control ( ) reflects the fact that there is no need to re- 
filter the data. On the plot of cibd.user (not shown) 
the prediction ellipse for simulated mother-daughter pairs 
where the daughter is inbred is very similar to that from 
simulated pairs where the daughter is not inbred (Figure 2, 
right panel). However, relative to the non-inbred case, 
the prediction ellipse in the inbred case is shifted slightly 
downward on the plot, reflecting the fact that the proba- 
bility of 2 genes IBD is now non-zero and the probability 
of 1 gene IBD is therefore smaller. 

Conclusions 

CrypticIBDcheck is an R package for exploring cryp- 
tic relatedness in a homogeneous sample of nominally 
unrelated individuals. The main function of the package, 
IBDcheck ( ) , computes estimates of IBD coefficients for 
pairs of study subjects and, optionally, for pairs of sub- 
jects simulated to have one of several known relationships. 
Simulated data for a given relationship are obtained by 
gene drop simulation on a pedigree that captures the rela- 
tionship, with founder haplotypes simulated according to 
an LD model fit to the data. Objects of class IBD returned 
by IBDcheck ( ) are displayed by the plot method of 
the class. Pairs of study subjects whose estimated IBD 
coefficients are consistent with one of the relationships 
requested in the call to IBDcheck ( ) are flagged, either 
automatically or interactively by user mouse-clicks, and 
returned in a data frame. 

The methods implemented in CrypticIBDcheck are 
geared specifically towards exploring cryptic relatedness 
with data from candidate-gene association studies. These 
studies involve a relatively modest number of SNPs which 
are correlated because they are clustered within candi- 
date genes. With a modest number of SNPs, the variability 
in the estimator of IBD coefficients cannot be ignored. 
Hence, reference distributions for true IBD coefficients do 
not adequately represent those for estimated IBD coef- 
ficients. In addition, thinning to an approximately inde- 
pendent and yet informative set of SNPs is not an option. 
Nor is ignoring LD and assuming SNPs are approximately 
independent. As illustrated in the Examples, ignoring LD 
leads to reference clusters that are too tight. 

To provide guidance on the numbers of SNPs for which 
CrypticIBDcheck will be useful, we offer the following 
observations. A lower bound on the number of SNPs 
required is difficult to provide because some marker sets 
are more relationship-informative than others of the same 
size, depending on characteristics such as marker allele 
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frequencies and the patterns of marker linkage disequilib- 
rium (LD). In our experiments with the example data set 
Nhlsim, containing 1249 SNPs from 209 genes, between 
325-350 SNPs from about 60 of the candidate genes 
appears to be adequate for identifying the parent-offspring 
and full sibling pairs that are present. There is no upper 
limit on the number of SNPs that may be used. How- 
ever, the computational time for fitting LD models scales 
approximately linearly with the number of SNPs [16]. For 
the Nhlsim data, it took 40 minutes to fit the LD model. 
For the close relationships that we consider, genome-wide 
data can be reduced to a set of approximately indepen- 
dent SNPs with no loss of resolution. An analysis of 
genotypes from 16,245 approximately independent SNPs 
in the HapMap Luhya sample took about 2 minutes to 
complete. This analysis is described the package vignette 
IBDcheck-HapMap. 

We offer the user complete flexibility with respect to the 
type of relationships and number of pairs of each relation- 
ship to be simulated. Users can choose from a number 
of close relationships built-in to IBDcheck ( ) , or spec- 
ify their own relationships, as illustrated in the section 
Additional gene drops using previously-fit LD models. 

A reviewer has pointed out that it would be useful 
to allow the reference distributions of IBD coefficients, 
represented as ellipses on the graphical displays, to be 
conditional on the patterns of missing genotypes in a 
particular pair of subjects. The intent is to allow for a 
two-stage analysis. In the first stage, potentially related 
subjects are identified using the current implementation 
of reference distributions, which are mixtures over the 
patterns of missing genotypes in all pairs of subjects. In 
the second stage, the reference distributions can be cus- 
tomized to be conditional on the missing data patterns 
of a pair of subjects identified in the first stage. For an 
assessment of whether the pair of interest has a particular 
relationship, distributions conditional on that pair's miss- 
ing data patterns are the most appropriate. We plan to 
implement an option to use a specific missing data pattern 
to generate the reference distributions in a future release 
of the package. 

Additional files 



Additional file 1 : Bias of conditional IBS estimators. This is a PDF file 
that includes a calculation of the bias of the plug-in and unbiased 
estimators of P(l = 0|Z = 0). Bias calculations for estimators of other 
conditional IBS probabilities are similar. 

Additional file 2: Splitting computations over a snow cluster. This is a 

PDF file that provides details of how to split CrypticlBDcheck 

computations across a compute cluster. 
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IBD: Identity by descent; IBD: Identity by state; LD: Linkage disequilibrium; MLE: 
Maximum likelihood estimator; MME: Method of moments estimator; SNP: 
Single nucleotide polymorphism. 
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